Abstract
Here you can find how the processing and plotting related to our B cell receptor clonotype analysis was performed for the paper entitled “Multivalent Antigen Display on Nanoparticles Elicits Diverse and Broad B cell Responses”This Rmarkdown provides the analysis for the processed
data in the paper entitled “Multivalent Antigen Display on Nanoparticles
Diversifies B Cell Responses”. It includes the B cell receptors
analysis, plots, and generation of intermediate files for plotting using
other programs. The processed files used here are AIRR-compliant
datasets which can be downloaded from Zenodo (DOI: 00000000), they are
either Human Respiratory Syncytial Virus-specific (RSV-specific) or the
full bulk repertoire from the vaccinated non-human primates (Macaca
mulatta).
Loaded libraries need to be present.
library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)
Some data used here is stored as .rds format, but it can
also be found as .tsv at Zenodo (DOI: 000000000).
data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
mutate(group = case_when(group == "NP 100%" ~ "20-mer",
group == "NP 50%" ~ "10-mer",
group == "Sol" ~ "1-mer",
group == "PostF" ~ "PostF"))
clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")
# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
filter(timepoint != "PV") %>%
mutate(cdr3_aa_length = nchar(CDR3_aa),
cdr3_aa = CDR3_aa,
v_call = V_gene,
j_call = J_gene,
d_call = D_gene)
# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))
# load mAbs data
lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")
# load competition ELISA data
# edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-04-28_LOR_norm-dAUC.csv", header = T, row.names = 1, encoding="UTF-8") %>%
replace(is.na(.), 0) %>%
mutate(specificity = factor(specificity, levels = c("PreF", "PreF/PostF", "PostF", "w.b.")),
epitope = factor(epitope, levels = c("Ø","V", "IV","III", "II","I", "foldon", "UK-Pre","UK-DP","PostF", "WB")))
# load light chains information clonotyped
clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE)
The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Here we show this data using different visualization methods.
Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus having higher titers for all the competitions, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.
cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>%
select(-c(timepoints, ID, group)) %>%
as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)
mds <- D_sim %>%
cmdscale(3) %>%
as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")
mds <- mds %>%
mutate(group = x$group,
timepoints = x$timepoints,
ID = x$ID)
return(mds)
}
The code below plots 4 different MDS plots. They are divided between
2 weeks after Boost 1 (B1) and Boost 2 (B2), with or without animals
vaccinated with PostF immunogen. On the top of each plot, it is written
B1 or B2, and legends say if PostF animals are included or not. The
cosine distance was calculated separately using the custom
cosine_distance() provided above, for that reason, position
of points should not be compared directly between each plot but rather
with points within each plot.
# calculate cosine distance and generate MDS with and without PostF
mds_b1 <- data_comp %>% filter(timepoints == "B1") %>% cosine_distance()
mds_b2 <- data_comp %>% filter(timepoints == "B2") %>% cosine_distance()
mds_no_postf_b1 <- data_comp %>% filter(group != "PostF", timepoints == "B1") %>% cosine_distance()
mds_no_postf_b2 <- data_comp %>% filter(group != "PostF", timepoints == "B2") %>% cosine_distance()
ls_mds <- list(mds_b1, mds_b2, mds_no_postf_b1, mds_no_postf_b2)
plot_list <- list()
for(f in seq_along(ls_mds)){
# Plot and color by groups
mds_plot <- ls_mds[[f]]
if(f %in% c(2,4)){
mds_plot$Dim.1 <- mds_plot$Dim.1 * -1 # flip axis in first dimension
}
plot <- mds_plot %>%
ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) +
geom_point(size = 4, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
lims(x = c(-.5,.5), y = c(-.4, .4)) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
legend.position = c(.2,.2),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
facet_wrap(~timepoints)
plot_list[[f]] <- plot
ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 4, height = 4)
}
ggarrange(plot_list[[3]], plot_list[[4]], plot_list[[1]], plot_list[[2]], ncol = 2, nrow = 2, common.legend = TRUE,legend = "top", legend.grob = get_legend(plot_list[[2]] + theme(legend.position = "top")))
data_comp_longer <- data_comp %>%
pivot_longer(cols = 2:9, names_to = "mAb", values_to = "ELISA_competition") %>%
mutate(epitopes = plyr::mapvalues(mAb,
from = c("D25.PreF", "MPE8.PreF", "ADI.PreF", "Pali.PreF", "X101F.PreF",
"ADI.PostF", "X101F.PostF", "Pali.PostF"),
to = c("Ø", "III", "I", "II", "IV",
"I", "IV", "II")),
epitopes = factor(epitopes, levels = c("Ø", "III", "IV", "II", "I")),
conformation = factor(case_when(grepl("PostF", mAb) ~ "PostF",
grepl("PreF", mAb) ~ "PreF"), levels = c("PreF", "PostF")),
timepoint_group = paste(timepoints, group, sep = "_"),
timepoint_group_epitope = paste(timepoints, group, epitopes, sep = "_"))
my_comparisons_sites <- list(c("B1_1-mer", "B1_20-mer"),
c("B1_1-mer", "B1_10-mer"),
c("B1_1-mer", "B1_PostF"),
c("B1_10-mer", "B1_20-mer"),
c("B1_10-mer", "B1_PostF"),
c("B1_20-mer", "B1_PostF"),
c("B2_1-mer", "B2_20-mer"),
c("B2_1-mer", "B2_10-mer"),
c("B2_1-mer", "B2_PostF"),
c("B2_10-mer", "B2_20-mer"),
c("B2_10-mer", "B2_PostF"),
c("B2_20-mer", "B2_PostF"))
my_comparisons_1 <- lapply(my_comparisons_sites, paste0, "_I")
my_comparisons_2 <- lapply(my_comparisons_sites, paste0, "_II")
my_comparisons_3 <- lapply(my_comparisons_sites, paste0, "_III")
my_comparisons_4 <- lapply(my_comparisons_sites, paste0, "_IV")
my_comparisons_5 <- lapply(my_comparisons_sites, paste0, "_Ø")
# Selecting sites for statistical comparison only between groups for each timepoint
# removed statistical comparisons for Boost 1 site I in PreF since most values were below the threshold of detection
my_comparisons_preF <- c(my_comparisons_1[7:12], my_comparisons_2, my_comparisons_3, my_comparisons_4, my_comparisons_5)
my_comparisons_postF <- c(my_comparisons_1, my_comparisons_2, my_comparisons_4)
stat.test <- data_comp_longer %>%
filter(conformation == "PreF") %>%
wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope,
paired = FALSE,
p.adjust.method = "fdr",
comparisons = my_comparisons_preF)
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|
| ELISA_competition | B2_1-mer_I | B2_20-mer_I | 4 | 5 | 14.0 | 0.413 | 0.812 | ns |
| ELISA_competition | B2_1-mer_I | B2_10-mer_I | 4 | 5 | 20.0 | 0.020 | 0.176 | ns |
| ELISA_competition | B2_1-mer_I | B2_PostF_I | 4 | 4 | 0.0 | 0.029 | 0.190 | ns |
| ELISA_competition | B2_10-mer_I | B2_20-mer_I | 5 | 5 | 6.0 | 0.205 | 0.527 | ns |
| ELISA_competition | B2_10-mer_I | B2_PostF_I | 5 | 4 | 0.0 | 0.020 | 0.176 | ns |
| ELISA_competition | B2_20-mer_I | B2_PostF_I | 5 | 4 | 2.0 | 0.064 | 0.343 | ns |
| ELISA_competition | B1_1-mer_II | B1_20-mer_II | 4 | 5 | 3.0 | 0.111 | 0.428 | ns |
| ELISA_competition | B1_1-mer_II | B1_10-mer_II | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B1_1-mer_II | B1_PostF_II | 4 | 4 | 12.0 | 0.343 | 0.772 | ns |
| ELISA_competition | B1_10-mer_II | B1_20-mer_II | 5 | 5 | 6.0 | 0.222 | 0.545 | ns |
| ELISA_competition | B1_10-mer_II | B1_PostF_II | 5 | 4 | 16.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B1_20-mer_II | B1_PostF_II | 5 | 4 | 19.0 | 0.032 | 0.190 | ns |
| ELISA_competition | B2_1-mer_II | B2_20-mer_II | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_II | B2_10-mer_II | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_II | B2_PostF_II | 4 | 4 | 6.0 | 0.686 | 0.939 | ns |
| ELISA_competition | B2_10-mer_II | B2_20-mer_II | 5 | 5 | 9.0 | 0.548 | 0.883 | ns |
| ELISA_competition | B2_10-mer_II | B2_PostF_II | 5 | 4 | 6.0 | 0.413 | 0.812 | ns |
| ELISA_competition | B2_20-mer_II | B2_PostF_II | 5 | 4 | 9.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B1_1-mer_III | B1_20-mer_III | 4 | 5 | 7.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B1_1-mer_III | B1_10-mer_III | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B1_1-mer_III | B1_PostF_III | 4 | 4 | 13.5 | 0.124 | 0.446 | ns |
| ELISA_competition | B1_10-mer_III | B1_20-mer_III | 5 | 5 | 13.0 | 1.000 | 1.000 | ns |
| ELISA_competition | B1_10-mer_III | B1_PostF_III | 5 | 4 | 20.0 | 0.018 | 0.176 | ns |
| ELISA_competition | B1_20-mer_III | B1_PostF_III | 5 | 4 | 20.0 | 0.018 | 0.176 | ns |
| ELISA_competition | B2_1-mer_III | B2_20-mer_III | 4 | 5 | 4.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B2_1-mer_III | B2_10-mer_III | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_III | B2_PostF_III | 4 | 4 | 11.0 | 0.486 | 0.883 | ns |
| ELISA_competition | B2_10-mer_III | B2_20-mer_III | 5 | 5 | 8.0 | 0.421 | 0.812 | ns |
| ELISA_competition | B2_10-mer_III | B2_PostF_III | 5 | 4 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_20-mer_III | B2_PostF_III | 5 | 4 | 16.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B1_1-mer_IV | B1_20-mer_IV | 4 | 5 | 4.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B1_1-mer_IV | B1_10-mer_IV | 4 | 5 | 6.0 | 0.413 | 0.812 | ns |
| ELISA_competition | B1_1-mer_IV | B1_PostF_IV | 4 | 4 | 8.0 | 1.000 | 1.000 | ns |
| ELISA_competition | B1_10-mer_IV | B1_20-mer_IV | 5 | 5 | 11.0 | 0.841 | 0.958 | ns |
| ELISA_competition | B1_10-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B1_20-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B2_1-mer_IV | B2_20-mer_IV | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B2_1-mer_IV | B2_10-mer_IV | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_IV | B2_PostF_IV | 4 | 4 | 7.0 | 0.886 | 0.958 | ns |
| ELISA_competition | B2_10-mer_IV | B2_20-mer_IV | 5 | 5 | 4.0 | 0.095 | 0.428 | ns |
| ELISA_competition | B2_10-mer_IV | B2_PostF_IV | 5 | 4 | 5.0 | 0.286 | 0.671 | ns |
| ELISA_competition | B2_20-mer_IV | B2_PostF_IV | 5 | 4 | 12.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B1_1-mer_Ø | B1_20-mer_Ø | 4 | 5 | 11.0 | 0.898 | 0.958 | ns |
| ELISA_competition | B1_1-mer_Ø | B1_10-mer_Ø | 4 | 5 | 12.0 | 0.701 | 0.939 | ns |
| ELISA_competition | B1_1-mer_Ø | B1_PostF_Ø | 4 | 4 | 12.0 | 0.186 | 0.513 | ns |
| ELISA_competition | B1_10-mer_Ø | B1_20-mer_Ø | 5 | 5 | 10.0 | 0.666 | 0.939 | ns |
| ELISA_competition | B1_10-mer_Ø | B1_PostF_Ø | 5 | 4 | 16.0 | 0.109 | 0.428 | ns |
| ELISA_competition | B1_20-mer_Ø | B1_PostF_Ø | 5 | 4 | 16.0 | 0.109 | 0.428 | ns |
| ELISA_competition | B2_1-mer_Ø | B2_20-mer_Ø | 4 | 5 | 13.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B2_1-mer_Ø | B2_10-mer_Ø | 4 | 5 | 12.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B2_1-mer_Ø | B2_PostF_Ø | 4 | 4 | 16.0 | 0.029 | 0.190 | ns |
| ELISA_competition | B2_10-mer_Ø | B2_20-mer_Ø | 5 | 5 | 13.0 | 1.000 | 1.000 | ns |
| ELISA_competition | B2_10-mer_Ø | B2_PostF_Ø | 5 | 4 | 20.0 | 0.020 | 0.176 | ns |
| ELISA_competition | B2_20-mer_Ø | B2_PostF_Ø | 5 | 4 | 20.0 | 0.020 | 0.176 | ns |
data_comp_longer %>%
filter(conformation == "PreF") %>%
ggplot(aes(x = timepoint_group,
y = ELISA_competition,
color = group,
fill = group)) +
geom_point(size = 2, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
scale_y_log10() +
stat_summary(fun.y = mean,
geom = "crossbar",
color = "black") +
geom_hline(yintercept = 10, linetype = "dashed") +
labs(title = "PreF Antigenic Sites", y = "50% Competition Titer", x = "") +
theme(axis.ticks = element_line(size = .5),
legend.background = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
facet_wrap(~epitopes, nrow = 1)
stat.test <- data_comp_longer %>%
filter(conformation == "PostF") %>%
wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope,
paired = FALSE,
p.adjust.method = "fdr",
comparisons = my_comparisons_postF) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ELISA_competition | B1_1-mer_I | B1_20-mer_I | 4 | 5 | 0.0 | 0.020 | 0.088 | ns | 6533.097 | B1_1-mer…. | 1 | 7 |
| ELISA_competition | B1_1-mer_I | B1_10-mer_I | 4 | 5 | 0.0 | 0.020 | 0.088 | ns | 7250.111 | B1_1-mer…. | 1 | 4 |
| ELISA_competition | B1_1-mer_I | B1_PostF_I | 4 | 4 | 1.0 | 0.059 | 0.127 | ns | 7967.125 | B1_1-mer…. | 1 | 10 |
| ELISA_competition | B1_10-mer_I | B1_20-mer_I | 5 | 5 | 3.0 | 0.056 | 0.127 | ns | 8684.139 | B1_10-me…. | 4 | 7 |
| ELISA_competition | B1_10-mer_I | B1_PostF_I | 5 | 4 | 9.0 | 0.905 | 0.905 | ns | 9401.153 | B1_10-me…. | 4 | 10 |
| ELISA_competition | B1_20-mer_I | B1_PostF_I | 5 | 4 | 12.0 | 0.730 | 0.773 | ns | 10118.168 | B1_20-me…. | 7 | 10 |
| ELISA_competition | B2_1-mer_I | B2_20-mer_I | 4 | 5 | 0.0 | 0.016 | 0.088 | ns | 10835.182 | B2_1-mer…. | 13 | 19 |
| ELISA_competition | B2_1-mer_I | B2_10-mer_I | 4 | 5 | 5.0 | 0.286 | 0.381 | ns | 11552.196 | B2_1-mer…. | 13 | 16 |
| ELISA_competition | B2_1-mer_I | B2_PostF_I | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 12269.210 | B2_1-mer…. | 13 | 22 |
| ELISA_competition | B2_10-mer_I | B2_20-mer_I | 5 | 5 | 7.0 | 0.310 | 0.385 | ns | 12986.224 | B2_10-me…. | 16 | 19 |
| ELISA_competition | B2_10-mer_I | B2_PostF_I | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 13703.238 | B2_10-me…. | 16 | 22 |
| ELISA_competition | B2_20-mer_I | B2_PostF_I | 5 | 4 | 3.0 | 0.111 | 0.195 | ns | 14420.252 | B2_20-me…. | 19 | 22 |
| ELISA_competition | B1_1-mer_II | B1_20-mer_II | 4 | 5 | 3.5 | 0.125 | 0.205 | ns | 15137.266 | B1_1-mer…. | 2 | 8 |
| ELISA_competition | B1_1-mer_II | B1_10-mer_II | 4 | 5 | 8.5 | 0.771 | 0.793 | ns | 15854.280 | B1_1-mer…. | 2 | 5 |
| ELISA_competition | B1_1-mer_II | B1_PostF_II | 4 | 4 | 0.0 | 0.026 | 0.088 | ns | 16571.294 | B1_1-mer…. | 2 | 11 |
| ELISA_competition | B1_10-mer_II | B1_20-mer_II | 5 | 5 | 6.5 | 0.236 | 0.327 | ns | 17288.309 | B1_10-me…. | 5 | 8 |
| ELISA_competition | B1_10-mer_II | B1_PostF_II | 5 | 4 | 1.0 | 0.034 | 0.088 | ns | 18005.323 | B1_10-me…. | 5 | 11 |
| ELISA_competition | B1_20-mer_II | B1_PostF_II | 5 | 4 | 2.0 | 0.064 | 0.127 | ns | 18722.337 | B1_20-me…. | 8 | 11 |
| ELISA_competition | B2_1-mer_II | B2_20-mer_II | 4 | 5 | 1.0 | 0.032 | 0.088 | ns | 19439.351 | B2_1-mer…. | 14 | 20 |
| ELISA_competition | B2_1-mer_II | B2_10-mer_II | 4 | 5 | 4.0 | 0.190 | 0.297 | ns | 20156.365 | B2_1-mer…. | 14 | 17 |
| ELISA_competition | B2_1-mer_II | B2_PostF_II | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 20873.379 | B2_1-mer…. | 14 | 23 |
| ELISA_competition | B2_10-mer_II | B2_20-mer_II | 5 | 5 | 8.0 | 0.421 | 0.474 | ns | 21590.393 | B2_10-me…. | 17 | 20 |
| ELISA_competition | B2_10-mer_II | B2_PostF_II | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 22307.407 | B2_10-me…. | 17 | 23 |
| ELISA_competition | B2_20-mer_II | B2_PostF_II | 5 | 4 | 2.0 | 0.064 | 0.127 | ns | 23024.421 | B2_20-me…. | 20 | 23 |
| ELISA_competition | B1_1-mer_IV | B1_20-mer_IV | 4 | 5 | 1.0 | 0.032 | 0.088 | ns | 23741.435 | B1_1-mer…. | 3 | 9 |
| ELISA_competition | B1_1-mer_IV | B1_10-mer_IV | 4 | 5 | 4.5 | 0.219 | 0.320 | ns | 24458.449 | B1_1-mer…. | 3 | 6 |
| ELISA_competition | B1_1-mer_IV | B1_PostF_IV | 4 | 4 | 2.0 | 0.114 | 0.195 | ns | 25175.464 | B1_1-mer…. | 3 | 12 |
| ELISA_competition | B1_10-mer_IV | B1_20-mer_IV | 5 | 5 | 6.0 | 0.222 | 0.320 | ns | 25892.478 | B1_10-me…. | 6 | 9 |
| ELISA_competition | B1_10-mer_IV | B1_PostF_IV | 5 | 4 | 6.0 | 0.413 | 0.474 | ns | 26609.492 | B1_10-me…. | 6 | 12 |
| ELISA_competition | B1_20-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.607 | ns | 27326.506 | B1_20-me…. | 9 | 12 |
| ELISA_competition | B2_1-mer_IV | B2_20-mer_IV | 4 | 5 | 0.0 | 0.016 | 0.088 | ns | 28043.520 | B2_1-mer…. | 15 | 21 |
| ELISA_competition | B2_1-mer_IV | B2_10-mer_IV | 4 | 5 | 6.0 | 0.413 | 0.474 | ns | 28760.534 | B2_1-mer…. | 15 | 18 |
| ELISA_competition | B2_1-mer_IV | B2_PostF_IV | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 29477.548 | B2_1-mer…. | 15 | 24 |
| ELISA_competition | B2_10-mer_IV | B2_20-mer_IV | 5 | 5 | 7.0 | 0.310 | 0.385 | ns | 30194.562 | B2_10-me…. | 18 | 21 |
| ELISA_competition | B2_10-mer_IV | B2_PostF_IV | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 30911.576 | B2_10-me…. | 18 | 24 |
| ELISA_competition | B2_20-mer_IV | B2_PostF_IV | 5 | 4 | 3.0 | 0.111 | 0.195 | ns | 31628.590 | B2_20-me…. | 21 | 24 |
data_comp_longer %>%
filter(conformation == "PostF") %>%
ggplot(aes(x = timepoint_group,
y = ELISA_competition,
color = group,
fill = group)) +
geom_point(size = 2, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
scale_y_log10() +
stat_summary(fun.y = mean,
geom = "crossbar",
color = "black") +
geom_hline(yintercept = 10, linetype = "dashed") +
labs(title = "PostF Antigenic Sites", y = "50% Competition Titer", x = "") +
theme(axis.ticks = element_line(size = .5),
legend.background = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
facet_wrap(~epitopes, nrow = 1)
Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.
rep_seq_stat <- lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <- lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE)
rep_seq_stat <- rep_seq_stat %>%
select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))
write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )
After merging the antigen-specific sequences and bulk sequencing, we
run the the clonotype module from IgDiscover to assign
clonotype grouping to each sequence. We have done that for both the
individualized germline and the KIMDB database. Here we are reading
those datasets
ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls) & grepl("individualized|nestor-rm", ls) ]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
mutate(
Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
ID = gsub("_.*", "", ID_timepoint),
database = plyr::mapvalues(.id,
from = c("nestor-rm", "individualized"),
to = c("KIMDB", "Individualized")
),
database = factor(database, levels = c("KIMDB", "Individualized")),
Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
v_call = ifelse(is.na(v_call), V_gene, v_call),
j_call = ifelse(is.na(j_call), J_gene, j_call)
) %>%
group_by(.id, ID_timepoint) %>%
distinct(new_name, .keep_all = TRUE) %>%
ungroup() %>%
filter(database %in% c("KIMDB", "Individualized"))
Plotting sequencing depth for each timepoint per animal. This indicates that the sequencing depth for Boost 1 was lower, thus for that reason all the analysis were normalized by sequencing depth to take that into account.
# samples from BR2-B2 from E17 were merged with TR1-B2
# this was done due to low sequencing depth for that animal and large expansion of uncharacterized clones
full_rep_indiv <- read.table("../data/processed_data/summary_sequencing_table.tsv",
header = TRUE) %>%
separate(ID, sep = "_", into = c("sample", "ID")) %>%
filter(sample != "TR2-B2") %>%
mutate(sample = ifelse(sample == "BR2-B2", "TR1-B2", sample),
timepoint = case_when(sample == "igm" ~ "PV",
grepl("B1", sample) ~ "B1",
grepl("B2", sample) ~ "B2"),
timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
group_by(timepoint, ID) %>%
summarise(across(where(is.numeric), sum)) %>%
ungroup()
# since here we wanted to compare all groups, we used Dunn's test
stat.test <- full_rep_indiv %>%
dunn_test(formula = assignment_filtering.has_cdr3 ~ timepoint,
p.adjust.method = "fdr") %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| assignment_filtering.has_cdr3 | PV | B1 | 9 | 9 | -3.6525703 | 0.0002596 | 0.0007789 | *** | 1393265 | PV, B1 | 1 | 2 |
| assignment_filtering.has_cdr3 | PV | B2 | 9 | 9 | -0.2672612 | 0.7892680 | 0.7892680 | ns | 1491771 | PV, B2 | 1 | 3 |
| assignment_filtering.has_cdr3 | B1 | B2 | 9 | 9 | 3.3853091 | 0.0007110 | 0.0010665 | ** | 1590276 | B1, B2 | 2 | 3 |
full_rep_indiv %>%
ggdotplot(y = "assignment_filtering.has_cdr3", x = "timepoint",
group = "timepoint",
fill = "ID",
size = 1,) +
geom_boxplot(alpha = .2, outlier.shape = NA) +
ggpubr::stat_compare_means(method = "kruskal") +
labs(x = "Timepoint", y = "# good quality aligned sequences") +
scale_fill_viridis_d() +
stat_pvalue_manual(stat.test %>% mutate(p.adj = round(p.adj, 4)), label = "p.adj") +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)
ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)
rds_summary <- rds_merge %>%
group_by(database, ID_timepoint, grp) %>%
summarise(
ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
clonal_size = n(),
database,
sc_clone_grp = first(sc_clone_grp),
V_errors = mean(V_errors),
specificity_group = first(specificity_group),
cdr3_aa_length = mean(cdr3_aa_length),
v_call = first(v_call),
j_call = first(j_call)
) %>%
ungroup() %>%
group_by(database, ID_timepoint) %>%
mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
ungroup() %>%
distinct()
rds_summary_noPV <- rds_summary %>%
filter(Timepoint != "PV", Timepoint != "Single-cell")
rds_summary_save <- rds_summary %>%
filter(Timepoint %in% c("Single-cell", "B1","B2"),
database == "Individualized") %>%
group_by(ID_timepoint) %>%
summarise(mean_clonal_size = mean(clonal_size),
median_clonal_size = median(clonal_size),
geom_clonal_size = exp(mean(log(clonal_size))),
unique_clones = sum(clonal_size == 1),
total_clones_detected = n(),
percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2))
rds_summary_save %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| ID_timepoint | mean_clonal_size | median_clonal_size | geom_clonal_size | unique_clones | total_clones_detected | percentage_unique_clones |
|---|---|---|---|---|---|---|
| E11_B1 | 73.765766 | 24.0 | 23.618580 | 7 | 111 | 6.31 |
| E11_B2 | 65.842324 | 18.0 | 16.139896 | 22 | 241 | 9.13 |
| E11_Single-cell | 1.596491 | 1.0 | 1.310902 | 169 | 228 | 74.12 |
| E12_B1 | 96.802632 | 8.5 | 11.635387 | 5 | 76 | 6.58 |
| E12_B2 | 62.696296 | 9.0 | 9.733452 | 18 | 135 | 13.33 |
| E12_Single-cell | 1.556000 | 1.0 | 1.304827 | 185 | 250 | 74.00 |
| E14_B1 | 79.458333 | 9.0 | 9.683393 | 11 | 96 | 11.46 |
| E14_B2 | 111.530612 | 10.5 | 12.335661 | 29 | 196 | 14.80 |
| E14_Single-cell | 2.041096 | 1.0 | 1.527244 | 139 | 219 | 63.47 |
| E16_B1 | 73.535519 | 14.0 | 15.223779 | 14 | 183 | 7.65 |
| E16_B2 | 118.491228 | 31.5 | 29.123260 | 17 | 228 | 7.46 |
| E16_Single-cell | 1.996094 | 1.0 | 1.489967 | 169 | 256 | 66.02 |
| E17_B1 | 38.460526 | 8.0 | 9.753009 | 8 | 76 | 10.53 |
| E17_B2 | 55.614035 | 6.0 | 7.902730 | 31 | 171 | 18.13 |
| E17_Single-cell | 1.665158 | 1.0 | 1.381309 | 151 | 221 | 68.33 |
| E18_B1 | 41.323529 | 11.0 | 11.772174 | 9 | 102 | 8.82 |
| E18_B2 | 72.662921 | 21.0 | 15.643111 | 23 | 178 | 12.92 |
| E18_Single-cell | 1.641026 | 1.0 | 1.368241 | 135 | 195 | 69.23 |
| E21_B1 | 35.201835 | 5.0 | 6.887328 | 21 | 109 | 19.27 |
| E21_B2 | 45.202312 | 14.0 | 13.455144 | 16 | 173 | 9.25 |
| E21_Single-cell | 1.674208 | 1.0 | 1.385378 | 151 | 221 | 68.33 |
| E23_B1 | 59.675497 | 9.0 | 11.417886 | 19 | 151 | 12.58 |
| E23_B2 | 63.044025 | 13.0 | 14.180265 | 18 | 159 | 11.32 |
| E23_Single-cell | 1.916279 | 1.0 | 1.484490 | 136 | 215 | 63.26 |
| E24_B1 | 55.469231 | 10.0 | 11.115858 | 18 | 130 | 13.85 |
| E24_B2 | 62.310735 | 9.0 | 10.429075 | 29 | 177 | 16.38 |
| E24_Single-cell | 1.829897 | 1.0 | 1.400170 | 137 | 194 | 70.62 |
write.csv(rds_summary_save,
paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)
rds_indiv <- rds_summary %>%
filter(
database == "Individualized",
Timepoint != "Single-cell",
Timepoint != "PV"
) %>%
mutate(
LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
)
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
tidyr::expand(ID, Timepoint, grp) %>%
filter(Timepoint != "Single-cell")
rds_indiv <- rds_indiv %>%
right_join(all) %>%
mutate(
ID_timepoint = paste(ID, Timepoint, sep = "_"),
database = "individualized",
clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
) %>%
arrange(grp, ID_timepoint) %>%
tidyr::fill(LOR, Group, sc_clone_grp)
rds_indiv %>%
group_by(Group, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
tidyr::drop_na() %>%
filter(specificity_group != "PostF") %>%
ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
geom_area(color = "black") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
scale_x_discrete(expand = c(0.02, 0.02)) +
scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
labs(y = "Clonal size") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~Group)
ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)
rds_indiv %>%
group_by(Group, ID, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
tidyr::drop_na() %>%
filter(specificity_group != "PostF") %>%
ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
geom_area(color = "black") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 30000)) +
scale_x_discrete(expand = c(0.02, 0.02)) +
scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
labs(y = "Clonal size") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~Group + ID, ncol = 5)
df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
for (databases in unique(rds_summary_noPV$database)) {
rds_timepoint <- rds_summary_noPV %>%
filter(Timepoint == timepoints, database == databases) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
rds_indiv_corr <- rds_summary_noPV %>%
filter(
database == "Individualized",
Timepoint == timepoints
) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
length(rds_indiv_corr) <- length(rds_timepoint)
df <- data.frame(
size_indiv = rds_indiv_corr,
clonal_size = rds_timepoint,
Timepoint = timepoints,
database = databases
)
df[is.na(df)] <- 0
df_all <- rbind(df, df_all)
}
}
df_all %>%
filter(database != "Individualized") %>%
mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
ggplot(aes(x = size_indiv, y = clonal_size)) +
geom_point(size = 1) +
geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
scale_x_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
scale_y_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~ Timepoint)
ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)
for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
# check if we want to filter clonotypes by size or not
filter(database == i) %>%
rbind(., within(., specificity_group <- "Total")) %>%
mutate(v_call = sub("\\*.*","", v_call),
j_call = sub("\\*.*|-.*","", j_call),
clonal_size_log = log10(clonal_size),
Group = factor(Group, levels = c("20-mer", "1-mer")),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
group_by(Group) %>%
mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
ungroup() %>%
group_by(Group, specificity_group, v_call, j_call) %>%
summarise(clonal_size_perc = sum(clonal_size_perc)) %>%
filter(specificity_group != "PostF") %>% droplevels()
plot_vj <- rds_summary_noPV %>%
ggplot(aes(x= v_call, y = j_call, fill = clonal_size_perc)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "viridis", direction = 1) +
scale_y_discrete(position = "right") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
coord_equal() +
theme(axis.text.x = element_text(angle = 90, size = 5, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
axis.text.y = element_text(face = "bold", colour = "black", size = 5),
strip.text = element_text(face = "bold", size = 7),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top",
axis.ticks = element_line(size = .5),
legend.title = element_text()) +
labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles", title = paste0(i, " Database")) +
facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")
print(plot_vj)
ggsave(plot = plot_vj, filename = paste0("../",result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}
# groups to perform statistical comparisons
my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
c("Total_B2_20-mer", "Total_B2_1-mer"),
c("PreF_B1_20-mer", "PreF_B1_1-mer"),
c("PreF_B2_20-mer", "PreF_B2_1-mer"),
c("DP_B1_20-mer", "DP_B1_1-mer"),
c("DP_B2_20-mer", "DP_B2_1-mer"))
rds_summary_noPV <- rds_summary %>%
# If want to remove clonal sizer < 1, do it here.
filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls) %>%
summarise(unique_v_j = n()) %>%
ungroup()
rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)
stat.test <- rds_summary_noPV %>%
wilcox_test(formula = unique_v_j ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| unique_v_j | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 14 | 0.413 | 0.496 | ns | 137.040 | Total_B1…. | 1 | 2 |
| unique_v_j | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 11 | 0.901 | 0.901 | ns | 150.288 | Total_B2…. | 3 | 4 |
| unique_v_j | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 4 | 0.176 | 0.264 | ns | 163.536 | PreF_B1_…. | 5 | 6 |
| unique_v_j | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 0 | 0.016 | 0.048 |
|
176.784 | PreF_B2_…. | 7 | 8 |
| unique_v_j | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
190.032 | DP_B1_20…. | 9 | 10 |
| unique_v_j | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 18 | 0.064 | 0.127 | ns | 203.280 | DP_B2_20…. | 11 | 12 |
rds_summary_noPV %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)
rds_summary_noPV <- rds_summary %>%
# decide if clonal size greater than 1 should be included
filter(database == "Individualized", Timepoint != "Single-cell") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls)
list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
"1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])
ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7,
label_alpha = 0,
set_color = "black",
set_size = 9) +
scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
scale_color_manual(values = c("#5F90B0", "#D896C1")) +
theme(legend.position = "none")
ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)
Intermediate files are commented out, so it will not save all of
them. One could uncomment them if all the intermediate files are needed.
Intermediate files were used to run Recon (v2.5) (Kaplinsky
& Arnaout, Nat Commmun, 2016) according to default
parameters, more about running Recon is described in the next section Recon estimates for
RSV-specific diversity.
# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))
# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()
# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
# save full clonotype files
filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
filter(ID_timepoint_spec == animals) %>%
as.data.frame()
# write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)
# save summary files
clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
#write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
# save summary files
clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
filter(ID_timepoint == animals) %>%
as.data.frame() %>%
mutate(
specificity_group = "Total",
ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
) %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()
for (animals in unique(clonotype_rsv$ID)) {
# save summary files
clonotype_rsv %>%
filter(ID == animals) %>%
as.data.frame() %>%
group_by(grp, timepoint, ID) %>%
summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
select(clonal_size, ID_timepoint_spec) %>%
group_by(clonal_size, ID_timepoint_spec) %>%
summarise(size = n()) %>%
ungroup()
for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
to_recon_table <- to_recon_rsv %>%
filter(ID_timepoint_spec == i) %>%
select(clonal_size, size)
# fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}
Species richness (Chao1) was calculated using the vegan
r package. The samples were first subsampled 100 times for each
animal/timepoint and then the species richness was estimated. The mean
of those 100x replicates were used for plotting.
Here is the processing and estimation of species richness:
rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))
# subsample and replicate the subsampling 100 times for higher accuracy
chaox100 <- function(x, value_to_subsample) {
replicate(100, {
subsample <- vegan::rrarefy(x, value_to_subsample)
chao <- vegan::estimateR(subsample)
return(chao)
})
}
diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))
# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
print(spec_timepoint)
rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}
## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"
change_names <- function(x) {
names(x) <- gsub("_.*", "", names(x))
x
}
# adjusted dataset for plotting
{
chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
# save intermediate file
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
# diversity mean of x100 replicates per animal
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
group_by(algorithm, Timepoint_specificity) %>%
summarise_all(.funs = mean) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)
chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
))
}
Here is the plotting and comparison between vaccinated group:
chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"
chao_results_df <- chao_results_df %>%
filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
summarise(value = mean(value)) %>%
tidyr::drop_na() %>%
ungroup()
write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)
stat.test <- chao_results_df %>%
wilcox_test(formula = value ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| value | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.285 | ns | 272.3352 | Total_B1…. | 1 | 2 |
| value | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 15 | 0.286 | 0.343 | ns | 301.8278 | Total_B2…. | 3 | 4 |
| value | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 331.3205 | PreF_B1_…. | 5 | 6 |
| value | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 360.8131 | PreF_B2_…. | 7 | 8 |
| value | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
390.3058 | DP_B1_20…. | 9 | 10 |
| value | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
419.7984 | DP_B2_20…. | 11 | 12 |
chao_results_df %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "value",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Chao1)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)
According to Recon default settings and tutorial (check
Recon manual), the count files generated previously were used to create
the fitfiles.txt that were used as an input for generating
the diversity_table.txt for all the samples.
To generate the the fitfile for each count file, the bash script used in a for loop was:
#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE
python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt
Then, each fit file was used for generating the
diversity_table.txt with the command:
python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt
The results from diversity_table.txt for all the samples
were used as input for plotting.
recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
ungroup()
write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
stat.test <- recon_res %>%
wilcox_test(formula = est_0.0D ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| est_0.0D | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 275.7082 | Total_B1…. | 1 | 2 |
| est_0.0D | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 304.8120 | Total_B2…. | 3 | 4 |
| est_0.0D | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 333.9159 | PreF_B1_…. | 5 | 6 |
| est_0.0D | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 363.0197 | PreF_B2_…. | 7 | 8 |
| est_0.0D | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
392.1236 | DP_B1_20…. | 9 | 10 |
| est_0.0D | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
421.2274 | DP_B2_20…. | 11 | 12 |
recon_res %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Recon)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)
Here the SHM is shown for all the sequences for every animal combined. Data is divided by 20-mer, 1-mer, and specificities. T
rds_indiv_total <- rds_indiv %>%
mutate(specificity_group = "Total")
rds_indiv <- rbind(rds_indiv, rds_indiv_total)
rds_indiv_plot <- rds_indiv %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup()
stat.test <- rds_indiv_plot %>%
wilcox_test(formula = V_errors ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| V_errors | Total_B1_20-mer | Total_B1_1-mer | 651 | 383 | 119282.5 | 0.246 | 0.492 | ns | 68.320 | Total_B1…. | 1 | 2 |
| V_errors | Total_B2_20-mer | Total_B2_1-mer | 976 | 682 | 349160.0 | 0.088 | 0.265 | ns | 73.504 | Total_B2…. | 3 | 4 |
| V_errors | PreF_B1_20-mer | PreF_B1_1-mer | 205 | 205 | 21279.5 | 0.824 | 0.841 | ns | 78.688 | PreF_B1_…. | 5 | 6 |
| V_errors | PreF_B2_20-mer | PreF_B2_1-mer | 346 | 389 | 69762.5 | 0.391 | 0.587 | ns | 83.872 | PreF_B2_…. | 7 | 8 |
| V_errors | DP_B1_20-mer | DP_B1_1-mer | 413 | 144 | 29401.0 | 0.841 | 0.841 | ns | 89.056 | DP_B1_20…. | 9 | 10 |
| V_errors | DP_B2_20-mer | DP_B2_1-mer | 583 | 250 | 79353.0 | 0.042 | 0.251 | ns | 94.240 | DP_B2_20…. | 11 | 12 |
rds_indiv_plot %>%
ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity",
legend = "none") +
geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
scale_shape_manual(values=NA)+
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
labs(y = "# IGHV nucleotide mutations", x= "") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)
Here the SHM data is summarized per macaque, so each dot represents the average SHM of all the antigen-specific sequences for one animal.
rds_indiv_plot_summ <- rds_indiv_plot %>%
filter(Group_specificity != "PostF") %>%
group_by(ID, Group_specificity) %>%
summarise(avg_V_errors = mean(V_errors, na.rm = TRUE)) %>%
ungroup() %>%
tidyr::drop_na()
stat.test <- rds_indiv_plot_summ %>%
wilcox_test(formula = avg_V_errors ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| avg_V_errors | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 4 | 0.190 | 0.826 | ns | 20.4600 | Total_B1…. | 1 | 2 |
| avg_V_errors | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 12 | 0.730 | 0.876 | ns | 22.2204 | Total_B2…. | 3 | 4 |
| avg_V_errors | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 11 | 0.905 | 0.905 | ns | 23.9808 | PreF_B1_…. | 5 | 6 |
| avg_V_errors | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 12 | 0.730 | 0.876 | ns | 25.7412 | PreF_B2_…. | 7 | 8 |
| avg_V_errors | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 5 | 0.286 | 0.826 | ns | 27.5016 | DP_B1_20…. | 9 | 10 |
| avg_V_errors | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 14 | 0.413 | 0.826 | ns | 29.2620 | DP_B2_20…. | 11 | 12 |
rds_indiv_plot_summ %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "avg_V_errors", fill = "Group_specificity", group = "Group_specificity",
legend = "none") +
geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) +
scale_shape_manual(values=NA)+
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
stat_summary(fun.y = mean,
geom = "crossbar") +
labs(y = "# IGHV nucleotide mutations", x= "") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 22) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
rds_indiv %>%
filter(specificity_group != "PostF") %>%
mutate(
Group_specificity = paste(Group, specificity_group, sep = "_"),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
Timepoint = factor(Timepoint, levels = c("B1", "B2"))
) %>%
ggplot(aes(x = cdr3_aa_length, fill = Group)) +
geom_density(alpha = .7) +
scale_y_continuous(expand = c(0, 0)) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
labs(y = "Density", x = "HCDR3 aa length") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v")
ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)
kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")
joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]
## DNAStringSet object of length 15:
## width seq names
## [1] 296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
## [2] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
## [3] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [4] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [5] 298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
## ... ... ...
## [11] 296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12] 299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13] 296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14] 299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15] 297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01
size_uniq_kimdb <- data.frame(
v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
type = "Not validated"
) %>%
dplyr::rename(
v_count = v_unique.Freq,
ID = v_unique.Var1
)
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
group_by(ID) %>%
arrange(ID, .by_group = TRUE) %>%
mutate(
diff = v_count - lag(v_count, default = last(v_count)),
diff = ifelse(diff < 0, v_count, diff)
)
kimdb_v_indiv %>%
write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)
kimdb_v_indiv %>%
mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
ggplot(aes(x = ID, y = diff, fill = type)) +
geom_col(color = "black") +
scale_fill_viridis_d(direction = -1) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
scale_x_discrete(expand = c(0, 0)) +
labs(y = "V alelle counts", x = "Animal ID") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)
clonal_tree_data <- clonotype_rsv %>%
ungroup() %>%
distinct(new_name, .keep_all = TRUE) %>%
group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
add_tally(name = "clonal_size") %>%
ungroup()
mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03", "LOR74" = "E11_05_B06")
# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")
# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")
sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA heavy chain
UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
names(UCA) <- paste0(mab_name,"_UCA")
# select sequences part of the same clonotype
group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clonal_tree_data %>%
filter(grp == group)
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
row.names = FALSE)
}
# deduplicating sequences
fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
#adding single cell sequences
sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
#saving duplicated
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
fasta <- c(fasta, sc_seqs, UCA)
fasta <- fasta[unique(names(fasta))]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
## [1] 3
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA light chain
UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
# select light chain clonotypes
group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clono_light_chains %>%
filter(grp == group, substr(name,1,3) == substr(mab,1,3))
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
row.names = FALSE)
}
# save object as BStringset
fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
# save duplicated values
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
fasta <- c(fasta, UCA_LC)
fasta <- fasta[names(fasta)]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
## [1] 3
Do a Multiple Sequence Alignment using MUSCLE (v 5.1)
for all the fasta files generated through out this analysis. Following
that, run FastTree (v 2.1.11) for all the aligned sequences
and save tree output to be plotted on the following code.
source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/
for f in *.fasta; do muscle -align $f -output $f.aln; done
for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 217 seqs, avg length 364, max 370
##
## 00:00 4.5Mb CPU has 8 cores, running 8 threads
## 00:00 4.7Mb 0.0043% Calc posteriors
00:01 181Mb 1.1% Calc posteriors
00:02 203Mb 2.1% Calc posteriors
00:03 236Mb 3.3% Calc posteriors
00:04 275Mb 4.3% Calc posteriors
00:05 296Mb 5.4% Calc posteriors
00:06 319Mb 6.4% Calc posteriors
00:07 329Mb 7.1% Calc posteriors
00:08 355Mb 8.1% Calc posteriors
00:09 376Mb 9.0% Calc posteriors
00:10 399Mb 10.0% Calc posteriors
00:11 404Mb 11.1% Calc posteriors
00:12 417Mb 12.2% Calc posteriors
00:13 438Mb 13.3% Calc posteriors
00:14 483Mb 14.3% Calc posteriors
00:15 508Mb 15.3% Calc posteriors
00:16 526Mb 16.4% Calc posteriors
00:17 565Mb 17.3% Calc posteriors
00:18 616Mb 18.3% Calc posteriors
00:19 652Mb 19.2% Calc posteriors
00:20 683Mb 20.2% Calc posteriors
00:21 692Mb 21.3% Calc posteriors
00:22 729Mb 22.3% Calc posteriors
00:23 777Mb 23.3% Calc posteriors
00:24 795Mb 24.4% Calc posteriors
00:25 824Mb 25.4% Calc posteriors
00:26 839Mb 26.4% Calc posteriors
00:27 846Mb 27.4% Calc posteriors
00:28 861Mb 28.4% Calc posteriors
00:29 868Mb 29.4% Calc posteriors
00:30 881Mb 30.4% Calc posteriors
00:31 903Mb 31.4% Calc posteriors
00:32 931Mb 32.4% Calc posteriors
00:33 943Mb 33.4% Calc posteriors
00:34 847Mb 34.4% Calc posteriors
00:35 874Mb 35.4% Calc posteriors
00:36 876Mb 36.4% Calc posteriors
00:37 884Mb 37.5% Calc posteriors
00:38 897Mb 38.5% Calc posteriors
00:39 828Mb 39.5% Calc posteriors
00:40 832Mb 40.6% Calc posteriors
00:41 845Mb 41.6% Calc posteriors
00:42 847Mb 42.7% Calc posteriors
00:43 868Mb 43.7% Calc posteriors
00:44 891Mb 44.7% Calc posteriors
00:45 911Mb 45.8% Calc posteriors
00:46 927Mb 46.8% Calc posteriors
00:47 951Mb 47.8% Calc posteriors
00:48 994Mb 48.9% Calc posteriors
00:49 1.0Gb 49.9% Calc posteriors
00:50 1.0Gb 50.9% Calc posteriors
00:51 1.0Gb 52.0% Calc posteriors
00:52 1.1Gb 53.0% Calc posteriors
00:53 1.1Gb 53.8% Calc posteriors
00:54 1.1Gb 54.5% Calc posteriors
00:55 1.1Gb 55.0% Calc posteriors
00:56 1.1Gb 55.9% Calc posteriors
00:57 1.2Gb 56.8% Calc posteriors
00:58 1.2Gb 57.8% Calc posteriors
00:59 1.1Gb 58.8% Calc posteriors
01:00 1.1Gb 59.7% Calc posteriors
01:01 1.1Gb 60.5% Calc posteriors
01:02 1.1Gb 61.2% Calc posteriors
01:03 991Mb 61.8% Calc posteriors
01:04 998Mb 62.4% Calc posteriors
01:05 1.0Gb 63.0% Calc posteriors
01:06 1.0Gb 63.6% Calc posteriors
01:07 1.0Gb 64.1% Calc posteriors
01:08 1.0Gb 64.7% Calc posteriors
01:09 1.0Gb 65.2% Calc posteriors
01:10 1.0Gb 65.8% Calc posteriors
01:11 1.0Gb 66.4% Calc posteriors
01:12 1.1Gb 66.9% Calc posteriors
01:13 1.1Gb 67.5% Calc posteriors
01:14 1.1Gb 68.1% Calc posteriors
01:15 1.1Gb 68.7% Calc posteriors
01:16 1.1Gb 69.3% Calc posteriors
01:17 1.1Gb 70.0% Calc posteriors
01:18 1.1Gb 70.7% Calc posteriors
01:19 1.1Gb 71.4% Calc posteriors
01:20 1.1Gb 72.1% Calc posteriors
01:21 1.0Gb 73.0% Calc posteriors
01:22 1.0Gb 73.8% Calc posteriors
01:23 1.1Gb 74.7% Calc posteriors
01:24 1.1Gb 75.7% Calc posteriors
01:25 1.1Gb 76.5% Calc posteriors
01:26 1.1Gb 77.5% Calc posteriors
01:27 1.1Gb 78.3% Calc posteriors
01:28 1.1Gb 79.2% Calc posteriors
01:29 1.2Gb 80.0% Calc posteriors
01:30 1.2Gb 80.8% Calc posteriors
01:31 1.2Gb 81.6% Calc posteriors
01:32 1.2Gb 82.4% Calc posteriors
01:33 1.2Gb 83.2% Calc posteriors
01:34 1.2Gb 84.0% Calc posteriors
01:35 1.2Gb 84.7% Calc posteriors
01:36 1.1Gb 85.5% Calc posteriors
01:37 1.1Gb 86.3% Calc posteriors
01:38 1.1Gb 87.0% Calc posteriors
01:39 1.1Gb 87.8% Calc posteriors
01:40 1.1Gb 88.6% Calc posteriors
01:41 1.1Gb 89.4% Calc posteriors
01:42 1.2Gb 90.3% Calc posteriors
01:43 1.2Gb 91.0% Calc posteriors
01:44 1.0Gb 91.7% Calc posteriors
01:45 1.1Gb 92.5% Calc posteriors
01:46 1.1Gb 93.3% Calc posteriors
01:47 1.1Gb 94.1% Calc posteriors
01:48 1.1Gb 94.8% Calc posteriors
01:49 1.1Gb 95.7% Calc posteriors
01:50 1.1Gb 96.4% Calc posteriors
01:51 1.1Gb 97.3% Calc posteriors
01:52 1.2Gb 98.1% Calc posteriors
01:53 1.2Gb 99.0% Calc posteriors
01:54 1.2Gb 99.8% Calc posteriors
01:54 1.2Gb 100.0% Calc posteriors
## 01:54 1.2Gb 0.0043% Consistency (1/2)
01:55 1.2Gb 7.4% Consistency (1/2)
01:56 1.2Gb 21.8% Consistency (1/2)
01:57 1.3Gb 36.8% Consistency (1/2)
01:58 1.3Gb 51.0% Consistency (1/2)
01:59 1.3Gb 65.9% Consistency (1/2)
02:00 1.3Gb 80.4% Consistency (1/2)
02:01 1.4Gb 94.9% Consistency (1/2)
02:01 1.4Gb 100.0% Consistency (1/2)
## 02:01 1.4Gb 0.0043% Consistency (2/2)
02:02 1.4Gb 4.2% Consistency (2/2)
02:03 1.4Gb 17.8% Consistency (2/2)
02:04 1.4Gb 31.7% Consistency (2/2)
02:05 1.4Gb 44.5% Consistency (2/2)
02:06 1.4Gb 55.6% Consistency (2/2)
02:07 1.4Gb 68.1% Consistency (2/2)
02:08 1.4Gb 78.5% Consistency (2/2)
02:09 1.4Gb 89.8% Consistency (2/2)
02:09 1.4Gb 100.0% Consistency (2/2)
## 02:09 1.4Gb 0.46% UPGMA5
02:09 1.4Gb 100.0% UPGMA5
## 02:10 1.4Gb 1.0% Refining
02:11 1.4Gb 20.0% Refining
02:12 1.4Gb 44.0% Refining
02:13 1.4Gb 68.0% Refining
02:14 1.4Gb 85.0% Refining
02:14 1.4Gb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 8 seqs, avg length 322, max 326
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.3Mb 3.6% Calc posteriors
00:01 68Mb 89.3% Calc posteriors
00:01 70Mb 100.0% Calc posteriors
## 00:01 73Mb 3.6% Consistency (1/2)
00:01 73Mb 100.0% Consistency (1/2)
## 00:01 73Mb 3.6% Consistency (2/2)
00:01 73Mb 100.0% Consistency (2/2)
## 00:01 73Mb 14.3% UPGMA5
00:01 73Mb 100.0% UPGMA5
## 00:01 73Mb 1.0% Refining
00:01 74Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 118 seqs, avg length 370, max 370
##
## 00:00 2.9Mb CPU has 8 cores, running 8 threads
## 00:00 3.1Mb 0.014% Calc posteriors
00:01 180Mb 2.9% Calc posteriors
00:02 192Mb 5.3% Calc posteriors
00:03 208Mb 7.5% Calc posteriors
00:04 211Mb 9.5% Calc posteriors
00:05 226Mb 12.0% Calc posteriors
00:06 249Mb 14.9% Calc posteriors
00:07 261Mb 17.8% Calc posteriors
00:08 279Mb 21.1% Calc posteriors
00:09 282Mb 24.3% Calc posteriors
00:10 294Mb 27.5% Calc posteriors
00:11 312Mb 30.5% Calc posteriors
00:12 323Mb 33.4% Calc posteriors
00:13 344Mb 36.3% Calc posteriors
00:14 346Mb 39.1% Calc posteriors
00:15 356Mb 41.9% Calc posteriors
00:16 368Mb 44.7% Calc posteriors
00:17 398Mb 47.4% Calc posteriors
00:18 416Mb 50.0% Calc posteriors
00:19 430Mb 52.5% Calc posteriors
00:20 442Mb 55.0% Calc posteriors
00:21 459Mb 57.6% Calc posteriors
00:22 482Mb 60.1% Calc posteriors
00:23 493Mb 62.8% Calc posteriors
00:24 517Mb 65.4% Calc posteriors
00:25 562Mb 68.1% Calc posteriors
00:26 617Mb 70.7% Calc posteriors
00:27 630Mb 72.6% Calc posteriors
00:28 660Mb 75.5% Calc posteriors
00:29 676Mb 78.4% Calc posteriors
00:30 694Mb 81.4% Calc posteriors
00:31 715Mb 84.4% Calc posteriors
00:32 740Mb 87.4% Calc posteriors
00:33 749Mb 90.2% Calc posteriors
00:34 754Mb 93.2% Calc posteriors
00:35 764Mb 96.1% Calc posteriors
00:36 790Mb 99.1% Calc posteriors
00:36 801Mb 100.0% Calc posteriors
## 00:36 801Mb 0.014% Consistency (1/2)
00:37 828Mb 53.2% Consistency (1/2)
00:37 852Mb 100.0% Consistency (1/2)
## 00:37 852Mb 0.014% Consistency (2/2)
00:38 853Mb 44.4% Consistency (2/2)
00:38 853Mb 100.0% Consistency (2/2)
## 00:38 853Mb 0.85% UPGMA5
00:38 853Mb 100.0% UPGMA5
## 00:38 854Mb 1.0% Refining
00:39 855Mb 21.0% Refining
00:39 860Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 10 seqs, avg length 331, max 334
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.3Mb 2.2% Calc posteriors
00:01 78Mb 68.9% Calc posteriors
00:01 86Mb 100.0% Calc posteriors
## 00:01 88Mb 2.2% Consistency (1/2)
00:01 89Mb 100.0% Consistency (1/2)
## 00:01 89Mb 2.2% Consistency (2/2)
00:01 89Mb 100.0% Consistency (2/2)
## 00:01 89Mb 11.1% UPGMA5
00:01 89Mb 100.0% UPGMA5
## 00:01 89Mb 1.0% Refining
00:01 89Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 71 seqs, avg length 370, max 370
##
## 00:00 2.4Mb CPU has 8 cores, running 8 threads
## 00:00 2.6Mb 0.04% Calc posteriors
00:01 170Mb 7.4% Calc posteriors
00:02 225Mb 15.6% Calc posteriors
00:03 247Mb 23.4% Calc posteriors
00:04 262Mb 31.7% Calc posteriors
00:05 270Mb 39.8% Calc posteriors
00:06 280Mb 46.8% Calc posteriors
00:07 301Mb 54.6% Calc posteriors
00:08 323Mb 62.8% Calc posteriors
00:09 335Mb 70.7% Calc posteriors
00:10 345Mb 78.7% Calc posteriors
00:11 364Mb 86.7% Calc posteriors
00:12 379Mb 94.2% Calc posteriors
00:12 382Mb 100.0% Calc posteriors
## 00:12 382Mb 0.04% Consistency (1/2)
00:13 393Mb 53.5% Consistency (1/2)
00:13 404Mb 100.0% Consistency (1/2)
## 00:13 404Mb 0.04% Consistency (2/2)
00:13 407Mb 100.0% Consistency (2/2)
## 00:13 407Mb 1.4% UPGMA5
00:13 407Mb 100.0% UPGMA5
## 00:13 408Mb 1.0% Refining
00:13 412Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 3 seqs, avg length 332, max 334
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.3Mb 33.3% Calc posteriors
00:00 2.6Mb 100.0% Calc posteriors
## 00:00 17Mb 33.3% Consistency (1/2)
00:00 17Mb 100.0% Consistency (1/2)
## 00:00 18Mb 33.3% Consistency (2/2)
00:00 18Mb 100.0% Consistency (2/2)
## 00:00 18Mb 50.0% UPGMA5
00:00 18Mb 100.0% UPGMA5
## 00:00 19Mb 1.0% Refining
00:01 19Mb 5.0% Refining
00:01 19Mb 100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.07 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
## 0.13 seconds: SPR round 1 of 2, 101 of 432 nodes
## 0.27 seconds: SPR round 1 of 2, 401 of 432 nodes
## 0.38 seconds: SPR round 2 of 2, 201 of 432 nodes
## 0.48 seconds: ME NNI round 21 of 31, 1 of 215 splits
## Total branch-length 2.197 after 0.50 sec
## 0.61 seconds: ML NNI round 1 of 16, 101 of 215 splits, 27 changes (max delta 1.450)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.69
## 0.74 seconds: Optimizing GTR model, step 2 of 12
## 0.85 seconds: Optimizing GTR model, step 4 of 12
## 0.98 seconds: Optimizing GTR model, step 6 of 12
## 1.09 seconds: Optimizing GTR model, step 9 of 12
## 1.19 seconds: Optimizing GTR model, step 12 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
## 1.29 seconds: Site likelihoods with rate category 3 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 1.49 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
## 1.62 seconds: ML NNI round 2 of 16, 201 of 215 splits, 63 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 1.66
## Turning off heuristics for final round of ML NNIs (converged)
## 1.82 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
## 1.95 seconds: ML NNI round 3 of 16, 201 of 215 splits, 57 changes (max delta 1.019)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 1.98 (final)
## Optimize all lengths: LogLk = -5298.540 Time 2.04
## 2.17 seconds: ML split tests for 100 of 214 internal splits
## 2.32 seconds: ML split tests for 200 of 214 internal splits
## Total time: 2.34 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.02
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.02
## Total time: 0.03 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.03 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
## 0.12 seconds: SPR round 1 of 2, 201 of 234 nodes
## 0.24 seconds: SPR round 2 of 2, 201 of 234 nodes
## Total branch-length 1.596 after 0.28 sec
## 0.47 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.49
## 0.58 seconds: Optimizing GTR model, step 5 of 12
## 0.69 seconds: Optimizing GTR model, step 11 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
## 0.79 seconds: Site likelihoods with rate category 18 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 0.89 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 0.92
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 0.97
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 1.02
## Turning off heuristics for final round of ML NNIs (converged)
## 1.01 seconds: ML NNI round 5 of 14, 1 of 116 splits
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 1.13 (final)
## 1.12 seconds: ML Lengths 1 of 116 splits
## Optimize all lengths: LogLk = -3814.284 Time 1.15
## 1.28 seconds: ML split tests for 100 of 115 internal splits
## Total time: 1.31 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.00 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.03
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.03 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.04
## Total time: 0.05 seconds Unique: 9/10 Bad splits: 0/6
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.01 seconds
## Refining topology: 25 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs
## 0.10 seconds: ME NNI round 17 of 25, 1 of 69 splits
## Total branch-length 0.623 after 0.11 sec
## ML-NNI round 1: LogLk = -2035.235 NNIs 32 max delta 1.30 Time 0.17
## 0.21 seconds: Optimizing GTR model, step 5 of 12
## GTR Frequencies: 0.2003 0.2812 0.3012 0.2172
## GTR rates(ac ag at cg ct gt) 1.3353 1.7821 0.7868 0.4971 1.3891 1.0000
## 0.31 seconds: Site likelihoods with rate category 14 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.772 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -1916.232 NNIs 8 max delta 0.00 Time 0.39
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -1916.232 NNIs 1 max delta 0.00 Time 0.45 (final)
## 0.45 seconds: ML Lengths 1 of 69 splits
## Optimize all lengths: LogLk = -1916.232 Time 0.47
## Total time: 0.57 seconds Unique: 71/71 Bad splits: 0/68
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR74_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 6 rounds ME-NNIs, 2 rounds ME-SPRs, 3 rounds ML-NNIs
## Total branch-length 0.053 after 0.00 sec
## ML-NNI round 1: LogLk = -560.888 NNIs 0 max delta 0.00 Time 0.00
## GTR Frequencies: 0.1960 0.3050 0.2620 0.2370
## GTR rates(ac ag at cg ct gt) 3.2025 7.6580 2.7938 2.4124 1.8012 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.635 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -548.791 NNIs 0 max delta 0.00 Time 0.00 (final)
## Optimize all lengths: LogLk = -548.791 Time 0.00
## Total time: 0.01 seconds Unique: 3/3 Bad splits: 0/0
Generated trees arre edited using treeio and plotted
with ggtree. The trees were rerooted to their respectives
Unmutated Common Ancestor (UCA), which in this case is defined as just
the V and J gene germlines combined. The gap between V-J is inserted
automatically by the alignment method, thus the CDR3 here is not
considered for the UCA.
# function to root tree in UCA
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}
ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)
names(ls) <- lapply(ls, function(x) {
if (stringr::str_count(x, "LOR") > 1) {
substr(x, 24, 34)
}else if (grepl("LC",x)) {
substr(x, 24, 31)
}else if (grepl("LOR",x)) {
substr(x, 24, 28)
}else{
substr(x, 24,33)
}})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)
for(i in seq_along(plots)){
print(i)
plot_name <- names(plots)[i]
plots_edit <- plots[[i]]$data %>%
mutate(new_label = ifelse(grepl("sc_|LOR",label), gsub("sc_", "",label), ""),
new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
warn_missing = FALSE),
new_label = ifelse(grepl("LOR",new_label), new_label, ""),
name = label,
timepoint = case_when(grepl("B2-igg",name) ~ "B2",
grepl("B1-igg",name) ~ "B1",
grepl("sc_|LOR",name) ~ "single_cell",
grepl("igm",name) ~ "PV",
TRUE ~ "intersects"),
name = gsub("sc_", "", label))
if(stringr::str_count(plot_name, "LOR") > 1){
plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name")
}else{
plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name")
}
# shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")
gg_plot <- ggtree(plots_edit,aes(color = timepoint)) +
{if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
else geom_tippoint(aes(size = duplicated), shape = 18)}+
# geom_tippoint(aes(size = duplicated), shape = 23) +
geom_tiplab(aes(label = new_label), hjust = -.2) +
labs(size = "Count", shape = "Shape", color = "Timepoint") +
scale_color_manual(values = colors_timepoints) +
coord_cartesian(clip = 'off') +
# scale_shape_manual(values = shapes_timepoints) +
scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
geom_treescale(width = .05)
print(gg_plot)
ggsave(gg_plot, filename = paste0("../", result.dir, names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 3, 4), height = ifelse(grepl("LC", plot_name), 3, 6))
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
Query LOR24 amino acid sequence in bulk repertoire sequencing from each animal. Query was done without alignment due to large number of sequences, but string distance was calculated using Levenstein distance.
The entire dataset was loaded, filtered and used to calculate
distances to LOR24. The output was generated and saved in
.rds format to avoid recaculation for each run.
processed_data <- readRDS("../data/clonotypes/individualized/processed_clonotypes.rds")
processed_data <- processed_data %>% select(grp, name, ID, timepoint, new_name, name, V_SHM, VDJ_aa)
LOR24aa <- "QLQLQESGPGLVKSSETLPLTCAVSGDSISSSYWSWIRQAPGKGLEWIGYIYGSGSYSHYNPSLKSRVTLSVDTSKNQFFLRLNSVTVADTAVYYCARGGRGNTYSWNRFDVWGPGVLVTVSS"
# calculate string distance between LOR24 and all the sequences
identity_matrix <- stringdist::stringdist(gsub("\\*", "",processed_data$VDJ_aa), LOR24aa, method = "lv", nthread = 8)
# calculate the percentage of identity, normalize by LOR24 length
processed_data <- processed_data %>%
mutate(identity = (1-identity_matrix/nchar(LOR24aa)) * 100) %>%
select(V_SHM, identity, ID)
saveRDS(processed_data, "../data/clonotypes/individualized/lor24_identity_calculation.rds")
The processed dataset above was used as an input to plot a 2d-histogram. This plot includes the somatic hypermutation percentage vs the identity to LOR24 amino acid sequences. In essence, this plot shows how a set of sequences across different non-human primates were evolving to become more identical to LOR24.
processed_data <- readRDS("../data/clonotypes/individualized/lor24_identity_calculation.rds")
# plot a 2d-histogram of mutations and identity to LOR24
processed_data %>%
filter(identity >= 70) %>%
ggplot(aes(x = V_SHM, y = identity)) +
geom_bin2d(bins = 30) +
scale_fill_viridis_c(trans = "log10") +
theme_prism() +
# scale_y_continuous(expand=c(0,1),limits=c(0,101)) +
labs(x = "% Divergence from germline", y = paste0("% Identity to LOR24aa"),
fill = "Sequence counts\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12)) +
facet_wrap(~ID)
ggsave(filename = paste0("../", result.dir,"LOR24_2d-histogram.pdf"))
LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.
data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)
data_comp_auc$IC50_RSV_A2 <- log10(data_comp_auc$IC50_RSV_A2)
# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF", "epitope", "specificity", "IC50_RSV_A2")
annot_colors <- list(specificity = c("PreF" = "#F7D586",
"PreF/PostF" = "#CD87F8",
"PostF" = "#92CDD6",
"w.b." = "grey20"), epitope = c(
"Ø" = "#F3B084",
"V" = "salmon",
"III" = "#A9D08D",
"IV" = "#FAD0FF",
"II" = "#FFD966",
"I" = "#9BC1E6",
"UK-Pre" = "#C9C9C9",
"UK-DP" = "#8497B0" ,
"WB" = "grey95",
"PostF" = "black",
"foldon" = "grey40"))
{
g_heatmap_scale <- data_comp_auc %>%
select(-all_of(to_remove)) %>%
mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
t() %>%
pheatmap::pheatmap(scale = "none", angle_col = 90, cutree_cols = 8,
clustering_method = "ward.D",
color = viridisLite::viridis(100),
cluster_rows = FALSE, border_color = "grey40",
cluster_cols = TRUE,
legend_breaks = c(0, 0.5, 1, 1.5, max(.)),
legend_labels = c("0", "0.5", "1", "1.5", "Normalized\ncompetition\n"),
annotation_col = data_comp_auc[,13:14],
annotation_colors = annot_colors,
cellwidth = 5, cellheight = 5, fontsize_row = 5, fontsize_col = 6, fontsize = 6)
ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 18, height = 12, units = "cm")
}
This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.
# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c(paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))
data_comp_auc <- data_comp_auc %>% tibble::rownames_to_column("name")
to_remove <- c("epitope", "specificity", "IC50_RSV_A2")
# compute MDS
mds_scaled <- data_comp_auc %>%
select(-all_of(c(to_remove, "name"))) %>%
scale(center = TRUE, scale = TRUE) %>%
dist(method = "euclidean") %>%
cmdscale() %>%
as_tibble()
colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
p1 <- mds_scaled %>%
left_join(data_comp_auc) %>%
ggplot(aes(Dim.1,Dim.2, label = name, color = epitope)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(max.overlaps = 5) +
scale_color_manual(values = annot_colors[["epitope"]]) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1)
plotly::ggplotly(p1)
ggsave(plot = p1, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-sites.pdf"))
p2 <- mds_scaled %>%
left_join(data_comp_auc) %>%
ggplot(aes(Dim.1,Dim.2, label = name, color = IC50_RSV_A2)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(max.overlaps = 5) +
scale_color_viridis_c() +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
labs(color = "IC50 RSV\n(log10)")+
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1)
plotly::ggplotly(p2)
ggsave(plot = p2, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-neuts.pdf"))
mabs_lors <- mabs_lors %>%
filter(name %in% lor_mabs$well_ID) %>%
mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
arrange(mAbs_ID) %>%
relocate(mAbs_ID)
mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp),
mabs_lors[mabs_lors$name == i,]$mAbs_ID,
NA)
}
col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])
mabs_clonal_rank <- mabs_clonal_rank %>%
mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
arrange(LOR) %>%
mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
filter(ID == well_ID)
mabs_lors <- mabs_lors %>%
mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)
# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2, clonal size sc = 1), LOR40 (not detected B2, clonal size sc = 1 ), LOR42 (same clone as LOR01), LOR94 (same clone as LOR01), LOR96 (not detected B2, clonal size sc = 4)
#mabs_lors %>% write.csv(paste0("../data/single_cell/LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(file = paste0("../data/single_cell/LOR_mAbs_info.csv"), sep = ",")
mabs_lors %>%
left_join(data_comp_auc %>% tibble::rowid_to_column("mAbs_ID") %>% mutate(mAbs_ID = as.character(name)), by = "mAbs_ID") %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = epitope)) +
geom_point(size = 3) +
scale_color_manual(values = annot_colors[["epitope"]]) +
scale_y_log10() +
scale_x_log10() +
labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
merged_mabs_lors <- mabs_lors %>%
left_join(clonotype_rsv, by = "name") %>%
left_join(clono_light_chains,suffix = c("_HC","_LC"), by = "name") %>%
mutate(v_call_HC = V_gene.x) %>%
fill(v_call_LC)
write.csv(merged_mabs_lors, file = "../data/single_cell/LOR_mAbs_info_full.csv", row.names = F)
merged_mabs_lors %>%
group_by(v_call_HC, v_call_LC) %>%
mutate(mabs_counts = n()) %>%
ggplot(aes(x= v_call_HC, y = v_call_LC, fill = mabs_counts)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "viridis", direction = 1) +
scale_y_discrete(position = "right") +
labs(fill = "mAb counts", x = "IGHV alleles", y = "IGHJ alleles") +
coord_equal() +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.text.x = element_text(angle = 90, size = 6, hjust = 1,
vjust = 0.5, face = "bold", colour = "black"),
axis.text.y = element_text(face = "bold", colour = "black", size = 6),
strip.text = element_text(face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "right",
axis.ticks = element_line(size = .5),
legend.title = element_text())
renv::snapshot()
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.3.4 dplyr_1.1.0 ggprism_1.0.4
## [4] scales_1.2.1 vegan_2.6-4 lattice_0.20-45
## [7] permute_0.9-7 data.table_1.14.8 Biostrings_2.66.0
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0 IRanges_2.32.0
## [13] S4Vectors_0.36.0 BiocGenerics_0.44.0 ggpubr_0.6.0
## [16] ggplot2_3.4.1 rstatix_0.7.2 ggtree_3.6.0
## [19] treeio_1.22.0 tidyr_1.3.0 jsonlite_1.8.4
##
## loaded via a namespace (and not attached):
## [1] ggVennDiagram_1.2.2 colorspace_2.1-0 ggsignif_0.6.4
## [4] ellipsis_0.3.2 class_7.3-21 aplot_0.1.10
## [7] rstudioapi_0.14 proxy_0.4-27 farver_2.1.1
## [10] ggrepel_0.9.3 fansi_1.0.4 xml2_1.3.3
## [13] splines_4.2.2 cachem_1.0.7 knitr_1.42
## [16] broom_1.0.4 cluster_2.1.4 pheatmap_1.0.12
## [19] compiler_4.2.2 httr_1.4.5 backports_1.4.1
## [22] Matrix_1.5-3 fastmap_1.1.1 lazyeval_0.2.2
## [25] cli_3.6.0 htmltools_0.5.4 tools_4.2.2
## [28] gtable_0.3.1 glue_1.6.2 GenomeInfoDbData_1.2.9
## [31] Rcpp_1.0.10 carData_3.0-5 jquerylib_0.1.4
## [34] vctrs_0.5.2 ape_5.7 svglite_2.1.1
## [37] nlme_3.1-162 crosstalk_1.2.0 xfun_0.37
## [40] stringr_1.5.0 rvest_1.0.3 lifecycle_1.0.3
## [43] zlibbioc_1.44.0 MASS_7.3-58.3 parallel_4.2.2
## [46] RColorBrewer_1.1-3 yaml_2.3.7 gridExtra_2.3
## [49] ggfun_0.0.9 yulab.utils_0.0.6 sass_0.4.5
## [52] stringi_1.7.12 highr_0.10 tidytree_0.4.2
## [55] e1071_1.7-13 rlang_1.0.6 pkgconfig_2.0.3
## [58] systemfonts_1.0.4 bitops_1.0-7 evaluate_0.20
## [61] purrr_1.0.1 sf_1.0-7 patchwork_1.1.2
## [64] htmlwidgets_1.6.1 labeling_0.4.2 cowplot_1.1.1
## [67] tidyselect_1.2.0 plyr_1.8.8 magrittr_2.0.3
## [70] R6_2.5.1 generics_0.1.3 DBI_1.1.3
## [73] pillar_1.8.1 withr_2.5.0 mgcv_1.8-42
## [76] units_0.8-1 abind_1.4-5 RCurl_1.98-1.10
## [79] tibble_3.2.0 crayon_1.5.2 car_3.1-1
## [82] KernSmooth_2.23-20 utf8_1.2.3 plotly_4.10.1
## [85] RVenn_1.1.0 rmarkdown_2.20 grid_4.2.2
## [88] digest_0.6.31 classInt_0.4-9 webshot_0.5.4
## [91] gridGraphics_0.5-1 munsell_0.5.0 viridisLite_0.4.1
## [94] ggplotify_0.1.0 bslib_0.4.2